Compressive strength or compression strength is the capacity of a material or structure to withstand loads tending to reduce size, as opposed to tensile strength, which withstands loads tending to elongate. It is one of the most important engineering properties of concrete. It is a standard industrial practice that the concrete is classified based on grades and this grade is nothing but the Compressive Strength of the concrete cube or cylinder. Cube or Cylinder samples are usually tested under a compression testing machine to obtain the compressive strength of concrete.
Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate.
The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from laboratory. Data is in raw form (not scaled).
Summary Statistics:
Number of instances (observations): 1030
Number of Attributes: 9
Attribute breakdown: 8 quantitative input variables, and 1 quantitative output variable
Using the data availble in the concrete.csv file, apply Feature Engineering methods to obtain 85% to 95% accuracy with 95% confidence level.
The basic ingredients of concrete are Portland Cement, Water and Aggregates (both fine and coarse). Reference: https://en.wikipedia.org/wiki/Concrete. These 3 mixtures form the basic types of plain concrete.
Portland cement is the most common type of cement in general usage. It consists of a mixture of calcium silicates (alite, belite), aluminates and ferrites-compounds which combine calcium, silicon, aluminum and iron in forms which will react with water.
Water content also largely determine the strength & workability of concrete. Greater the amount of water, higher will be the workability of concrete (more fluid) however it reduces the strength of concrete. But if you keep water too low, workability of water will also reduce.
Fine and coarse aggregates make up the bulk of a concrete mixture. The size distribution of the aggregate determines how much binder is required. Aggregate with a very even size distribution has the biggest gaps whereas adding aggregate with smaller particles tends to fill these gaps.
Admixtures are materials in the form of powder or fluids that are added to the concrete to give it certain characteristics not obtainable with plain concrete mixes. Superplasticizer is one such admixture which have fewer deleterious effects and can be used to increase workability and compressive strength more than is practical with traditional plasticizers. It lowers the need for water content by 15–30% there by leading to retarding effects.
Fly Ash and Blast Furnace Slag comes under the category of mineral admixtures which due to its chemical compositions are used to partially replace Portland cement by up to 60% and 80% by mass respectively. Mineral admixtures have latent hydraulic properties in general.
The Compressive strength of concrete can be calculated by the failure load divided with the cross sectional area resisting the load and reported in pounds per square inch in US customary units and mega pascals (MPa) in SI units. Concrete's compressive strength requirements can vary from 2500 psi (17 MPa) for residential concrete to 4000psi (28 MPa) and higher in commercial structures. Higher strengths upto and exceeding 10,000 psi (70 MPa) are specified for certain applications.
# Utilities
from time import time
import itertools
import warnings
import pydotplus
# Numerical calculation
import numpy as np
from scipy.stats import zscore
# Data handling
import pandas as pd
# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from sklearn.externals.six import StringIO
from IPython.display import Image
# Sample and parameter tuning
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score
# Predictive Modeling
from sklearn.svm import SVR
from sklearn.cluster import KMeans
from sklearn.tree import DecisionTreeRegressor, export_graphviz
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor
# Feature Engineering
from sklearn.utils import resample
from sklearn.decomposition import PCA
from sklearn.preprocessing import PolynomialFeatures
# Evaluation metrics
from sklearn.metrics import mean_squared_error
# Configure for any default setting of any library
warnings.filterwarnings('ignore')
%matplotlib inline
# sns.set(style='whitegrid', palette='deep', font='sans-serif', font_scale=1.2, color_codes=True)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
Comments
%matplotlib inline sets the backend of matplotlib to the 'inline' backend: With this backend, the output of plotting commands is displayed inline without needing to call plt.show() every time a data is plotted.# Load the dataset into a Pandas dataframe called concrete
concrete = pd.read_csv('concrete.csv')
# Save an original copy of the dataframe
concrete_original = concrete.copy()
# Check the head of the dataset
concrete.head()
# Check the tail of the dataset
concrete.tail()
Comments
The dataset is divided into two parts, namely, feature matrix and the response vector.
# Get the shape and size of the dataset
print("Number of rows :",concrete.shape[0])
print("Number of columns :",concrete.shape[1])
# Get more info on it
# 1. Name of the columns
# 2. Find the data types of each columns
# 3. Look for any null/missing values
concrete.info()
# Check for missing values
concrete.applymap(np.isnan).sum(axis=0)
# Check for any Non-Real value present in the dataset such as '?' or '*' etc.
concrete[~concrete.applymap(np.isreal)].sum(axis=0)
Comments
np.isreal a numpy function which checks each column for each row and returns a bool array, .applymap is pandas dataframe function that applies the np.isreal function columnwiseObservations
Let's explore the spread of data points, presence of any outliers for all attributes. We will be using the following plots for this analysis.
# Analyze the body of the distributions
cols = [i for i in concrete.columns if i not in 'strength']
fig = plt.figure(figsize=(17,10))
for i,j,k in itertools.zip_longest(cols, range(len(cols)), ["darkorange","r","g","c","m","k","lime","c"]):
plt.subplot(2,4,j+1)
ax = sns.distplot(concrete[i],color=k)
plt.axvline(concrete[i].mean(),linestyle="dashed",label="mean",color="k")
plt.legend(loc="best")
# Target feature "Comprehensive Strength" distribution
plt.figure(figsize=(13,6))
sns.set_style('darkgrid', {'axes.grid' : True})
sns.distplot(concrete["strength"],color="b",rug=True)
plt.axvline(concrete["strength"].mean(), linestyle="dashed", color="k", label='mean', linewidth=2)
plt.legend(loc="best",prop={"size":14})
Observations:
# Find out the count of 0s in each column
(concrete == 0).sum(axis=0)
Observations:
# Plot the central tendency of the dataset
_, bp = concrete.boxplot(return_type='both', figsize=(20,10), rot='vertical')
fliers = [flier.get_ydata() for flier in bp["fliers"]]
boxes = [box.get_ydata() for box in bp["boxes"]]
caps = [cap.get_ydata() for cap in bp['caps']]
whiskers = [whiskers.get_ydata() for whiskers in bp["whiskers"]]
# Describe the dataset with various other summary and statistics
conc_desc = concrete.describe()
conc_desc.loc['IQR'] = conc_desc.loc['75%'] - conc_desc.loc['25%']
for idx, col in enumerate(concrete.columns):
conc_desc.loc['min_whis',col] = caps[2 * idx][0]
conc_desc.loc['max_whis',col] = caps[2 * idx + 1][0]
conc_desc
Observations:
# Count the number of outlier data points present in each feature
for idx, col in enumerate(concrete.columns):
print(col, '(%d)--' % len(fliers[idx]), fliers[idx])
Min/Max Replacement: Outliers from slag, water and fineagg features will be replaced by their nearest whisker ends in the central tendency. Means datapoints which are $1.5*IQR$ below the $Q_1$ will be replaced by min end of lower whisker and which are $1.5*IQR$ above the $Q_3$ will be replaced by max end of higher whisker.
NOTE:
# outliers capping for slag
q3 = concrete.slag.quantile(0.75)
high = q3 + 1.5*(q3 - concrete.slag.quantile(0.25))
concrete.loc[(concrete.slag > high), 'slag'] = caps[3][0]
# outliers capping for water
q3 = concrete.water.quantile(0.75)
high = q3 + 1.5*(q3 - concrete.water.quantile(0.25))
concrete.loc[(concrete.water > high), 'water'] = caps[7][0]
# outliers capping for fineagg
q3 = concrete.fineagg.quantile(0.75)
high = q3 + 1.5*(q3 - concrete.fineagg.quantile(0.25))
concrete.loc[(concrete.fineagg > high), 'fineagg'] = caps[13][0]
# Check the dataset after Outlier treatment
sns.set_style('darkgrid')
plt.figure(figsize=(15,10))
for idx, col in enumerate(concrete.columns):
plt.subplot(1, len(concrete.columns), idx+1)
sns.boxplot(y=concrete[col], palette='inferno')
plt.tight_layout()
# Function to calculate the age span based on the age column
conc_age_span = concrete.copy()
def age_span(age):
weeks = int(np.round(age/7))
if weeks <= 2:
return '1 week' if weeks <= 1 else '2 weeks'
else:
mnths = int(np.round(age/30))
return str(mnths) + ' month' if mnths == 1 else str(mnths) + ' months'
conc_age_span['age_span'] = conc_age_span['age'].apply(age_span)
conc_age_span.head()
# Get the list of unique age spans
conc_age_span['age_span'].unique()
# Visualize the Age distribution
order = ['1 week','2 weeks','1 month','2 months','3 months','4 months','6 months','9 months','12 months']
fig, axes = plt.subplots(1, 2, figsize=(16,6))
sns.countplot(conc_age_span.age_span, order=order, ax=axes[0])
explode = (0,0,0,0.1,0.2,0.3,0.4,0.5,0.6)
_ = axes[1].pie(conc_age_span.age_span.value_counts(), autopct='%1.1f%%', explode=explode, pctdistance=1.1)
axes[1].legend(labels=conc_age_span.age_span.value_counts().index,loc="best")
axes[1].axis('equal')
plt.tight_layout()
# Get the central tendency grouped by age span
age_span = conc_age_span.groupby("age_span")["strength"].describe().loc[order].reset_index()
age_span
# Visualize the comprehensive strength grouped by age span
cols = ['mean', 'min', 'max']
fig = plt.figure(figsize=(13,10))
for i,j,k in itertools.zip_longest(cols,range(len(cols)),['b','r','g']):
plt.subplot(3,1,j+1)
ax = sns.pointplot(x='age_span', y=i, data=age_span, order=order, markers='H', linestyles='dotted', color=k)
ax.set_facecolor('silver')
plt.title(i + ' - compressive strength by age span'); fig.tight_layout()
Observation:
Let's explore the relationship among the predictor variables and between the predictor variables and target column. We will be using the density curve plus histogram, scatterplot and pairplots for this analysis.
Pairplot helps picturizing the pair wise relationship between two variables. It creates a square matrix of no. of continous attributes of the dataset. The diagonal plots represents the histogram and/or the kde plot of a particular attributes where as the upper or lower trangular plots represents the co-linearity of two attributes.
# Create an instance of the PairGrid class.
grid = sns.PairGrid(data=concrete)
# Map a scatter plot to the lower triangle
grid = grid.map_lower(plt.scatter, edgecolor='white')
# Map a histogram to the diagonal
grid = grid.map_diag(plt.hist, bins = 10, color='r', edgecolor='k')
# Map a density plot to the upper triangle
grid = grid.map_upper(sns.kdeplot)
Observations:
# Analyze the regression line of each predictor vs strength
cols = [i for i in concrete.columns if i not in 'strength']
fig = plt.figure(figsize=(17,10))
for i,j,k in itertools.zip_longest(cols, range(len(cols)), ["darkorange","r","g","c","m","k","lime","c"]):
plt.subplot(2,4,j+1)
sns.regplot(i, 'strength', concrete, color=k, marker='.')
The regression line for cement, water, superplastic and age seems to have more or less same relationship with strength but let's plot the corelation matrix to get the pearson's correlation coefficient for them.
# Visualize the correlation matrix
corr = concrete.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(12,10))
with sns.axes_style("white"):
sns.heatmap(corr,annot=True,linewidth=2, mask=mask, cmap="BrBG")
plt.title("Correlation between variables")
Observation:
Feature Generation is the technique of Feature engineering which deals with raw data to generate additional features which can describe higher degree of relationship with target variable.
# Feature 1 - Water to Cement ratio
np.corrcoef(x=concrete.cement/concrete.water, y=concrete.strength)
# Feature 2 - Cement to Coarse Aggregate ratio
np.corrcoef(x=concrete.cement/concrete.coarseagg, y=concrete.strength)
# Feature 3 - Cement to Fine Aggregate ratio
np.corrcoef(x=concrete.cement/concrete.fineagg, y=concrete.strength)
# Feature 4 - Cement to Aggregates ratio
np.corrcoef(x=concrete.cement/(concrete.coarseagg+concrete.fineagg), y=concrete.strength)
# Feature 5 - Admixtures to Cement ratio
np.corrcoef(x=(concrete.superplastic + concrete.slag + concrete.ash)/concrete.cement, y=concrete.strength)
Observations:
Hence we are not getting any strong correlationship out of any composite features with strength, let's go ahead and explore Polynomial features.
# Instantiate an poly object with 2nd degree
poly = PolynomialFeatures(degree=2, interaction_only=True)
# Fit and transform the X or input features
concrete_poly = poly.fit_transform(concrete.drop('strength', axis=1))
# Get the shape of the newly generated features
print("Number of rows :",concrete_poly.shape[0])
print("Number of columns :",concrete_poly.shape[1])
Observation:
# Join the strength column to create a polynomial dataset
concrete_poly = pd.DataFrame(concrete_poly,columns=poly.get_feature_names())
concrete_poly['strength'] = concrete['strength']
concrete_poly.shape
# Visualize the correlation matrix
plt.figure(figsize=(12,8))
sns.heatmap(concrete_poly.corr(), cmap="PRGn")
# Check the correlation coefficient of all the polynomial features with strength
poly_corrcoef = [np.corrcoef(concrete_poly[col], concrete_poly.strength)[0,1] for col in concrete_poly.columns[:-1]]
pd.DataFrame(poly_corrcoef, index = concrete_poly.columns[:-1],
columns=['Corr_Coef']).plot(kind='bar', figsize=(15,7), title='Corr-Coef of Polynomial Features', color='peru')
Observations:
Feature Selection is the technique of Feature engineering which analyzes the feature space to squeze out all the information to explain as much as variance it can there by getting rid of few unrelated dimensions.
Standardization with z-scores is the most commonly used method. It converts all indicators to a common scale with an average/mean of 0 and standard deviation of 1. As most of the features in our dataset have units Kg/m3 and age and strength have days and MPa as the units of measurements respectively, we apply z-score to convert them into a single scale.
concrete = concrete.apply(zscore)
concrete.describe()
# Create a covariance matrix and calculate eigen values
pca = PCA().fit(concrete.drop('strength', axis=1))
# calculate variance ratios
var = pca.explained_variance_ratio_;var
# cumulative sum of variance explained with [n] features
eigen_vals = np.round(pca.explained_variance_ratio_, decimals=3)*100
np.cumsum(eigen_vals)
threshold=97
def generate_scree_plot(covar_matrix, threshold):
var = covar_matrix.explained_variance_
eigen_vals = np.cumsum(np.round(covar_matrix.explained_variance_ratio_, decimals=3)*100)
f, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(20,7))
f.suptitle('PCA Scree plot')
ax1.plot(np.arange(1, len(var)+1), var, '-go')
ax1.set_xticks(np.arange(1, len(var)+1))
ax1.set_title('Explained Variance')
ax1.set_xlabel('# of Components')
ax1.set_ylabel('Eigen Values')
ax2.plot(np.arange(1, len(eigen_vals)+1), eigen_vals, ':k', marker='o', markerfacecolor='red', markersize=8)
ax2.set_xticks(np.arange(1, len(eigen_vals)+1))
ax2.axhline(y=threshold, color='r', linestyle=':', label='Threshold(95%)')
ax2.legend()
ax2.plot(np.arange(sum(eigen_vals <= threshold) + 1, len(eigen_vals) + 1),
[val for val in eigen_vals if val > threshold], '-bo')
ax2.set_title('Cumulative sum Explained Variance Ratio')
ax2.set_xlabel('# of Components')
ax2.set_ylabel('% Variance Explained')
generate_scree_plot(pca, threshold=threshold)
plt.figure(figsize=(15,7))
plt.axhline(y=threshold, color='r', linestyle=':', label='97% Threshold')
plt.bar(np.arange(1, len(eigen_vals) + 1), eigen_vals, label='Eigen Values')
plt.plot(np.arange(1, len(np.cumsum(eigen_vals))+1), np.cumsum(eigen_vals),
drawstyle='steps-mid', label='Cumulative Eigen Values')
plt.yticks(np.arange(0,105,5))
_, _ = plt.xticks(np.arange(1,9,1)), plt.xlabel('# of Components')
plt.legend()
Observations:
# Create a new matrix using the n components
X = concrete.drop('strength', axis=1)
X_proj = PCA(n_components=6).fit_transform(X)
y = concrete.strength
X_proj.shape
Train Test Split
Divide both the original and PCA projected datasets into 80:20 ratio for train and test respectively.
# Divide the original and the projected dataset into 80:20 ration for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_proj_train, X_proj_test, y_train, y_test = train_test_split(X_proj, y, test_size=0.2, random_state=1)
print('Original dimensions for train test split:\n', X_train.shape, X_test.shape, y_train.shape, y_test.shape)
print('\nProjected dimensions for train test split:\n', X_proj_train.shape, X_proj_test.shape, y_train.shape, y_test.shape)
Linear models are the simplest parametric methods. A regression is a prediction where the target is continuous in nature.
We are going to utilize Standard Linear Regression and Regularized Linear Regression for both original and projected dataset to verify whether there are any linear relationship of the predictors with that of target variable stregth.
An important metric used in regression is called cofficient of determinant or R-Square. It measures the amount of variance of the prediction which is explained by the dataset. $$R^2=1-\frac{\sum_i{r_i^2}}{\sum_i{(y_i-\bar{y})^2}}$$
# Fit the LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
# Print the coefficients of each attributes
lr.coef_
# Calculate the score of Linear Regression
print('Training score :', lr.score(X_train, y_train))
print('Testing score :', lr.score(X_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, lr.predict(X_test))))
# Calculate Cross Validation Score
lr_cv = cross_val_score(lr, X_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",lr_cv.mean())
print ("cv-std :",lr_cv.std())
print ("cv-max :",lr_cv.max())
print ("cv-min :",lr_cv.min())
ax = y_test.reset_index()['strength'].plot(label="originals",figsize=(17,5), linestyle=':')
ax = pd.Series(lr.predict(X_test)).plot(label = "predictions",figsize=(17,5),color='darkorange')
plt.legend(loc="best")
plt.title("ORIGINALS VS PREDICTIONS")
plt.xlabel("index")
plt.ylabel("values")
# Fit the LinearRegression
lr_proj = LinearRegression()
lr_proj.fit(X_proj_train, y_train)
# Print the coefficients of each attributes
lr_proj.coef_
# Calculate the score of Linear Regression
print('Training score :', lr_proj.score(X_proj_train, y_train))
print('Testing score :', lr_proj.score(X_proj_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, lr_proj.predict(X_proj_test))))
# Calculate Cross Validation Score
lr_proj_cv = cross_val_score(lr_proj, X_proj_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",lr_proj_cv.mean())
print ("cv-std :",lr_proj_cv.std())
print ("cv-max :",lr_proj_cv.max())
print ("cv-min :",lr_proj_cv.min())
ax = y_test.reset_index()['strength'].plot(label="originals",figsize=(17,5), linestyle=':')
ax = pd.Series(lr_proj.predict(X_proj_test)).plot(label = "predictions",figsize=(17,5),color='r')
plt.legend(loc="best")
plt.title("ORIGINALS VS PREDICTIONS")
plt.xlabel("index")
plt.ylabel("values")
Observations:
Let's apply regularization technique to see how the performance of the linear model gets impacted.
Linear models are being parametrc model contribute to more towards Bias Error. Regularizarion is needed inorder to decrease the bias error so we can find a sweet spot where the performance of our model stands promising for unseen data. 2 such regularized linear models are:
# Fit the Ridge regularised linear model
rg = Ridge(alpha=0.1)
rg.fit(X_train, y_train)
# Print the coefficients of each attributes
rg.coef_
# Calculate the score of Linear Regression
print('Training score :', rg.score(X_train, y_train))
print('Testing score :', rg.score(X_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, rg.predict(X_test))))
# Calculate Cross Validation Score
rg_cv = cross_val_score(rg, X_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",rg_cv.mean())
print ("cv-std :",rg_cv.std())
print ("cv-max :",rg_cv.max())
print ("cv-min :",rg_cv.min())
ax = y_test.reset_index()['strength'].plot(label="originals",figsize=(17,5), linestyle=':')
ax = pd.Series(rg.predict(X_test)).plot(label = "predictions",figsize=(17,5),color='g')
plt.legend(loc="best")
plt.title("ORIGINALS VS PREDICTIONS")
plt.xlabel("index")
plt.ylabel("values")
# Fit the Ridge regularised linear model
rg_proj = Ridge(alpha=0.1)
rg_proj.fit(X_proj_train, y_train)
# Print the coefficients of each attributes
rg_proj.coef_
# Calculate the score of Linear Regression
print('Training score :', rg_proj.score(X_proj_train, y_train))
print('Testing score :', rg_proj.score(X_proj_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, rg_proj.predict(X_proj_test))))
# Calculate Cross Validation Score
rg_proj_cv = cross_val_score(rg_proj, X_proj_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",rg_proj_cv.mean())
print ("cv-std :",rg_proj_cv.std())
print ("cv-max :",rg_proj_cv.max())
print ("cv-min :",rg_proj_cv.min())
ax = y_test.reset_index()['strength'].plot(label="originals",figsize=(17,5), linestyle=':')
ax = pd.Series(rg_proj.predict(X_proj_test)).plot(label = "predictions",figsize=(17,5),color='c')
plt.legend(loc="best")
plt.title("ORIGINALS VS PREDICTIONS")
plt.xlabel("index")
plt.ylabel("values")
# Fit the Lasso regularised linear model
ls = Lasso(alpha=0.1)
ls.fit(X_train, y_train)
# Print the coefficients of each attributes
ls.coef_
# Calculate the score of Linear Regression
print('Training score :', ls.score(X_train, y_train))
print('Testing score :', ls.score(X_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, ls.predict(X_test))))
# Calculate Cross Validation Score
ls_cv = cross_val_score(ls, X_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",ls_cv.mean())
print ("cv-std :",ls_cv.std())
print ("cv-max :",ls_cv.max())
print ("cv-min :",ls_cv.min())
ax = y_test.reset_index()['strength'].plot(label="originals",figsize=(17,5), linestyle=':')
ax = pd.Series(ls.predict(X_test)).plot(label = "predictions",figsize=(17,5),color='k')
plt.legend(loc="best")
plt.title("ORIGINALS VS PREDICTIONS")
plt.xlabel("index")
plt.ylabel("values")
# Fit the Lasso regularised linear model
ls_proj = Lasso(alpha=0.1)
ls_proj.fit(X_proj_train, y_train)
# Print the coefficients of each attributes
ls_proj.coef_
# Calculate the score of Linear Regression
print('Training score :', ls_proj.score(X_proj_train, y_train))
print('Testing score :', ls_proj.score(X_proj_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, ls_proj.predict(X_proj_test))))
# Calculate Cross Validation Score
ls_proj_cv = cross_val_score(ls_proj, X_proj_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",ls_proj_cv.mean())
print ("cv-std :",ls_proj_cv.std())
print ("cv-max :",ls_proj_cv.max())
print ("cv-min :",ls_proj_cv.min())
ax = y_test.reset_index()['strength'].plot(label="originals",figsize=(17,5), linestyle=':')
ax = pd.Series(ls_proj.predict(X_proj_test)).plot(label = "predictions",figsize=(17,5),color='m')
plt.legend(loc="best")
plt.title("ORIGINALS VS PREDICTIONS")
plt.xlabel("index")
plt.ylabel("values")
Observations:
Hence, to conclude, this problem of determining the compressive strength of concrete based on it's ingredients quantity, as a whole, doesn't belong to a problem that can be addressed by any Linear family algorithms. The model complexity required is of non-parametric.
We have noticed few of the features in the dataset posses a distribution with mix of multiple gaussians in the bi-variate analysis section. Let's apply clustering technique to find if there are any distiguished clusters and their suitability in compressive strength prediction.
K-Means clustering is one of the simplest and popular unsupervised machine learning algorithm for clustering. It identifies k number of centroids, and then allocates every data point to their nearest cluster, while keeping the centroids as small as possible. The algorithm works as follows:
The technique used to find out the optimized k value. There are 2 gaussians present in each of the 3 features (slag, ash and superplastic) is what we observed in the pairplot. Hence we will analyze more the data for more than 6, let's take 14 clusters.
k_range = range(1,15)
kmeans = [KMeans(n_clusters=n) for n in k_range]
sse = [kmeans[i].fit(concrete).inertia_ for i in range(len(kmeans))]
plt.figure(figsize=(12,6))
plt.plot(k_range, sse, marker='o')
plt.title('Elbow Plot')
_ = plt.xticks(k_range)
Observations:
# Initialize 6 centroid clusters and fit the dataset
kmeans = KMeans(n_clusters= 6, random_state=1)
kmeans.fit(concrete)
# Find out the count of observations in each clusters
labels = kmeans.labels_
plt.figure(figsize=(12,6))
plt.bar(np.unique(labels), np.bincount(labels), color='lightcyan', edgecolor='red')
plt.xlabel('Clusters')
plt.ylabel('Bin Counts')
## Creating a new dataframe only for labels and converting it into categorical variable
cluster_labels = pd.DataFrame(labels , columns = list(['labels']))
cluster_labels['labels'] = cluster_labels['labels'].astype('category')
# Join the labels with original dataset
concrete_labeled = concrete.join(cluster_labels)
_ = concrete_labeled.boxplot(by='labels', layout=(3,3), figsize=(20,13))
Observations:
Obtain feature importance for the individual features using multiple methods and present your findings.
We will analyze the following non-parametric models and check the accuracy scores with each one of them.
From Ensemble family:
svr = SVR(C=1, kernel='linear')
svr.fit(X_train, y_train)
# Calculate the score of Support Vector Regressor
print('Training score :', svr.score(X_train, y_train))
print('Testing score :', svr.score(X_test, y_test))
svr = SVR(kernel='rbf')
svr.fit(X_train, y_train)
# Calculate the score of Support Vector Regressor
print('Training score :', svr.score(X_train, y_train))
print('Testing score :', svr.score(X_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, svr.predict(X_test))))
# Calculate Cross Validation Score
svr_cv = cross_val_score(svr, X_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",svr_cv.mean())
print ("cv-std :",svr_cv.std())
print ("cv-max :",svr_cv.max())
print ("cv-min :",svr_cv.min())
ax = y_test.reset_index()['strength'].plot(label="originals",figsize=(17,5), linestyle=':')
ax = pd.Series(svr.predict(X_test)).plot(label = "predictions",figsize=(17,5),color='c')
plt.legend(loc="best")
plt.title("ORIGINALS VS PREDICTIONS")
plt.xlabel("index")
plt.ylabel("values")
Observations:
dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)
# Calculate the score of Decision Tree Regressor
print('Training score :', dtr.score(X_train, y_train))
print('Testing score :', dtr.score(X_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, dtr.predict(X_test))))
# Calculate Cross Validation Score
dtr_cv = cross_val_score(dtr, X_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",dtr_cv.mean())
print ("cv-std :",dtr_cv.std())
print ("cv-max :",dtr_cv.max())
print ("cv-min :",dtr_cv.min())
Observations:
This is the nature of Decision tree. Unless pruned, it tend to grow to the fullest untill extreme lowest node. Let's prune the Decision Tree in iteration 2
dtr = DecisionTreeRegressor(max_depth=6, max_leaf_nodes=40, min_samples_leaf=9)
dtr.fit(X_train, y_train)
# Calculate the score of Decision Tree Regressor
print('Training score :', dtr.score(X_train, y_train))
print('Testing score :', dtr.score(X_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, dtr.predict(X_test))))
# Calculate Cross Validation Score
dtr_cv = cross_val_score(dtr, X_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",dtr_cv.mean())
print ("cv-std :",dtr_cv.std())
print ("cv-max :",dtr_cv.max())
print ("cv-min :",dtr_cv.min())
ax = y_test.reset_index()['strength'].plot(label="originals",figsize=(17,5), linestyle=':')
ax = pd.Series(dtr.predict(X_test)).plot(label = "predictions",figsize=(17,5),color='b')
plt.legend(loc="best")
plt.title("ORIGINALS VS PREDICTIONS")
plt.xlabel("index")
plt.ylabel("values")
Observations:
dot_data = StringIO()
export_graphviz(dtr, out_file=dot_data, filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
# Feature Importance
dtr_df_fi = pd.DataFrame(dtr.feature_importances_, index = concrete.columns[:-1], columns=['Importance'])
dtr_df_fi.sort_values('Importance', ascending=False)
# Visualize the Feature Importances
dtr_df_fi.sort_values('Importance').plot(kind='barh', figsize=(10,5), color='b', title='Feature Importance')
coarseagg, superplastic, fineagg and ash features are identified to be of less important. Let's check the model performance by dropping them in Iteration 3
labels = ['coarseagg', 'superplastic', 'fineagg', 'ash']
X_dtr_fi_train = X_train.drop(labels=labels, axis=1)
X_dtr_fi_test = X_test.drop(labels=labels, axis=1)
dtr_fi = DecisionTreeRegressor(max_depth=6, max_leaf_nodes=40, min_samples_leaf=9)
dtr_fi.fit(X_dtr_fi_train, y_train)
# Calculate the score of Decision Tree Regressor
print('Training score :', dtr_fi.score(X_dtr_fi_train, y_train))
print('Testing score :', dtr_fi.score(X_dtr_fi_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, dtr_fi.predict(X_dtr_fi_test))))
# Calculate Cross Validation Score
dtr_fi_cv = cross_val_score(dtr_fi, X_dtr_fi_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",dtr_fi_cv.mean())
print ("cv-std :",dtr_fi_cv.std())
print ("cv-max :",dtr_fi_cv.max())
print ("cv-min :",dtr_fi_cv.min())
Observations:
Let's analyze the ensemble techniques.
Ensemble is the art of combining diverse set of learners (individual models) together to improvise on the stability and predictive power of the model.
Advantages of ensemble methods:
Bagging or Bootstrap Aggregating is an ensembling algorithm that can be used for both classification (BaggingClassifier) and regression (BaggingRegressor) problems. It follows the typical bagging technique to make predictions. Decision Tree is used as the default base estimator unless specified otherwise. Following are the steps for the bagging meta-estimator algorithm:
bag = BaggingRegressor()
bag.fit(X_train, y_train)
# Calculate the score of Decision Tree Regressor
print('Training score :', bag.score(X_train, y_train))
print('Testing score :', bag.score(X_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, bag.predict(X_test))))
# Calculate Cross Validation Score
bag_cv = cross_val_score(bag, X_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",bag_cv.mean())
print ("cv-std :",bag_cv.std())
print ("cv-max :",bag_cv.max())
print ("cv-min :",bag_cv.min())
ax = y_test.reset_index()['strength'].plot(label="originals",figsize=(17,5), linestyle=':')
ax = pd.Series(bag.predict(X_test)).plot(label = "predictions",figsize=(17,5),color='m')
plt.legend(loc="best")
plt.title("ORIGINALS VS PREDICTIONS")
plt.xlabel("index")
plt.ylabel("values")
Bagging ensembling with Support Vector Regressor as the base estomator. The accuracy score of SVR is more as compared to Decision Tree regressor, so let's check the performance of SVR as the base estimator of Bagging.
bag_svr = BaggingRegressor(base_estimator=SVR())
bag_svr.fit(X_train, y_train)
# Calculate the score of Decision Tree Regressor
print('Training score :', bag_svr.score(X_train, y_train))
print('Testing score :', bag_svr.score(X_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, bag_svr.predict(X_test))))
# Calculate Cross Validation Score
bag_svr_cv = cross_val_score(bag_svr, X_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",bag_svr_cv.mean())
print ("cv-std :",bag_svr_cv.std())
print ("cv-max :",bag_svr_cv.max())
print ("cv-min :",bag_svr_cv.min())
ax = y_test.reset_index()['strength'].plot(label="originals",figsize=(17,5), linestyle=':')
ax = pd.Series(bag_svr.predict(X_test)).plot(label = "predictions",figsize=(17,5),color='k')
plt.legend(loc="best")
plt.title("ORIGINALS VS PREDICTIONS")
plt.xlabel("index")
plt.ylabel("values")
Random Forest is another ensemble machine learning algorithm that follows the bagging technique. It is an extension of the bagging estimator algorithm. The base estimators in random forest are decision trees. Unlike bagging, random forest randomly selects a set of features which are used to decide the best split at each node of the decision tree.
Following are the steps of the Random Forest algorithm:
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
# Calculate the score of Decision Tree Regressor
print('Training score :', rfr.score(X_train, y_train))
print('Testing score :', rfr.score(X_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, rfr.predict(X_test))))
# Calculate Cross Validation Score
rfr_cv = cross_val_score(rfr, X_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",rfr_cv.mean())
print ("cv-std :",rfr_cv.std())
print ("cv-max :",rfr_cv.max())
print ("cv-min :",rfr_cv.min())
ax = y_test.reset_index()['strength'].plot(label="originals",figsize=(17,5), linestyle=':')
ax = pd.Series(rfr.predict(X_test)).plot(label = "predictions",figsize=(17,5),color='crimson')
plt.legend(loc="best")
plt.title("ORIGINALS VS PREDICTIONS")
plt.xlabel("index")
plt.ylabel("values")
Observations:
Let's analyze the Feature Importance.
# Feature Importance
rfr_df_fi = pd.DataFrame(rfr.feature_importances_, index = concrete.columns[:-1], columns=['Importance'])
rfr_df_fi.sort_values('Importance', ascending=False)
# Visualize the Feature Importances
rfr_df_fi.sort_values('Importance').plot(kind='barh', figsize=(10,5), color='crimson', title='Feature Importance')
superplastic, fineagg, coarseagg and ash features are identified to be of less important. Let's check the model performance by dropping them in Iteration 2
labels = ['superplastic', 'fineagg', 'coarseagg', 'ash']
X_rfr_fi_train = X_train.drop(labels=labels, axis=1)
X_rfr_fi_test = X_test.drop(labels=labels, axis=1)
rfr_fi = RandomForestRegressor()
rfr_fi.fit(X_rfr_fi_train, y_train)
# Calculate the score of Decision Tree Regressor
print('Training score :', rfr_fi.score(X_rfr_fi_train, y_train))
print('Testing score :', rfr_fi.score(X_rfr_fi_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, rfr_fi.predict(X_rfr_fi_test))))
# Calculate Cross Validation Score
rfr_fi_cv = cross_val_score(rfr_fi, X_rfr_fi_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",rfr_fi_cv.mean())
print ("cv-std :",rfr_fi_cv.std())
print ("cv-max :",rfr_fi_cv.max())
print ("cv-min :",rfr_fi_cv.min())
Observations:
Gradient Boosting is another ensemble machine learning algorithm that works for both regression and classification problems. GBM uses the boosting technique, combining a number of weak learners to form a strong learner. Regression trees used as a base learner, each subsequent tree in series is built on the errors calculated by the previous tree.
gbm = GradientBoostingRegressor()
gbm.fit(X_train, y_train)
# Calculate the score of Decision Tree Regressor
print('Training score :', gbm.score(X_train, y_train))
print('Testing score :', gbm.score(X_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, gbm.predict(X_test))))
# Calculate Cross Validation Score
gbm_cv = cross_val_score(gbm, X_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",gbm_cv.mean())
print ("cv-std :",gbm_cv.std())
print ("cv-max :",gbm_cv.max())
print ("cv-min :",gbm_cv.min())
Observations:
Let's analyze the Feature Importance.
# Feature Importance
gbm_df_fi = pd.DataFrame(gbm.feature_importances_, index = concrete.columns[:-1], columns=['Importance'])
gbm_df_fi.sort_values('Importance', ascending=False)
# Visualize the Feature Importances
gbm_df_fi.sort_values('Importance').plot(kind='barh', figsize=(10,5), color='blueviolet', title='Feature Importance')
Only ash feature is coming out to be of less important. Let's check the model performance by dropping them in Iteration 2
labels = ['ash']
X_gbm_fi_train = X_train.drop(labels=labels, axis=1)
X_gbm_fi_test = X_test.drop(labels=labels, axis=1)
gbm_fi = GradientBoostingRegressor()
gbm_fi.fit(X_gbm_fi_train, y_train)
# Calculate the score of Decision Tree Regressor
print('Training score :', gbm_fi.score(X_gbm_fi_train, y_train))
print('Testing score :', gbm_fi.score(X_gbm_fi_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, gbm_fi.predict(X_gbm_fi_test))))
# Calculate Cross Validation Score
gbm_fi_cv = cross_val_score(gbm_fi, X_gbm_fi_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",gbm_fi_cv.mean())
print ("cv-std :",gbm_fi_cv.std())
print ("cv-max :",gbm_fi_cv.max())
print ("cv-min :",gbm_fi_cv.min())
ax = y_test.reset_index()['strength'].plot(label="originals",figsize=(17,5), linestyle=':')
ax = pd.Series(gbm_fi.predict(X_gbm_fi_test)).plot(label = "predictions",figsize=(17,5),color='blueviolet')
plt.legend(loc="best")
plt.title("ORIGINALS VS PREDICTIONS")
plt.xlabel("index")
plt.ylabel("values")
Observations:
So, we are going to consider Gradient Boosting Regressor as our best suited model with ash emoved from feature space. Let's tune the model and check for model performance at 95% confidence level.
As discussed earlier, Gradient Boosting Regressor is coming out to be best performing algorithm among all.So we chose this algorithm to proceed further with Model Tuning.
Let's perform the grid search using scikit-learn’s GridSearchCV which stands for grid search cross validation. By default, the GridSearchCV’s cross validation uses 3-fold KFold or StratifiedKFold depending on the situation.
# Run GridSearch to tune the hyper-parameter
st = time()
k_fold_cv = 5 # Stratified 10-fold cross validation
grid_params = {
"loss":["ls", "lad"],
"learning_rate": [0.075, 0.1, 0.15],
"min_samples_split": np.linspace(0.1, 0.5, 5),
"min_samples_leaf": np.linspace(0.1, 0.5, 5),
"max_depth": [3,5,8],
"max_features":["log2","sqrt"],
"criterion": ["friedman_mse", "mae"],
"subsample":[0.9, 0.95, 1.0]
}
grid = GridSearchCV(GradientBoostingRegressor(n_estimators=5), param_grid=grid_params, cv=k_fold_cv,
n_jobs = 1, verbose = 0, return_train_score=True)
grid.fit(X_gbm_fi_train, y_train)
print('Time taken %.2fs to tune the best hyper-parameter for Gradient Boosting Regressor' % (time()-st))
# Use the tuned estimator from GridSearch to run the model
gbm_grid = grid.best_estimator_
gbm_grid.fit(X_gbm_fi_train, y_train)
# Calculate the score of Tuned Gradient Boosting Regressor
print('Training score :', gbm_grid.score(X_gbm_fi_train, y_train))
print('Testing score :', gbm_grid.score(X_gbm_fi_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, gbm_grid.predict(X_gbm_fi_test))))
# Calculate Cross Validation Score
gbm_grid_cv = cross_val_score(gbm_grid, X_gbm_fi_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",gbm_grid_cv.mean())
print ("cv-std :",gbm_grid_cv.std())
print ("cv-max :",gbm_grid_cv.max())
print ("cv-min :",gbm_grid_cv.min())
ax = y_test.reset_index()['strength'].plot(label="originals",figsize=(17,5), linestyle=':',color='r')
ax = pd.Series(gbm_grid.predict(X_gbm_fi_test)).plot(label = "predictions",figsize=(17,5),color='springgreen')
plt.legend(loc="best")
plt.title("ORIGINALS VS PREDICTIONS")
plt.xlabel("index")
plt.ylabel("values")
Let's perform the random search using scikit-learn’s RandomizedSearchCV. By default, the RandomizedSearchCV’s cross validation uses 3-fold KFold or StratifiedKFold depending on the situation.
# Run RandomizedSearchCV to tune the hyper-parameter
st = time()
k_fold_cv = 5 # Stratified 10-fold cross validation
params = {
"loss":["ls", "lad"],
"learning_rate": [0.075, 0.1, 0.15],
"min_samples_split": np.linspace(0.1, 0.5, 5),
"min_samples_leaf": np.linspace(0.1, 0.5, 5),
"max_depth": [3,5,8],
"max_features":["log2","sqrt"],
"criterion": ["friedman_mse", "mae"],
"subsample":[0.9, 0.95, 1.0]
}
random = RandomizedSearchCV(GradientBoostingRegressor(n_estimators=5), param_distributions=params, cv=k_fold_cv,
n_iter = 5, scoring='neg_mean_absolute_error',verbose=2, random_state=42,
n_jobs=-1, return_train_score=True)
random.fit(X_gbm_fi_train, y_train)
print('Time taken %.2fs to tune the best hyper-parameter for Gradient Boosting Regressor by Random Search.' % (time()-st))
# Use the tuned estimator from GridSearch to run the model
gbm_random = random.best_estimator_
gbm_random.fit(X_gbm_fi_train, y_train)
# Calculate the score of Tuned Gradient Boosting Regressor
print('Training score :', gbm_random.score(X_gbm_fi_train, y_train))
print('Testing score :', gbm_random.score(X_gbm_fi_test, y_test))
# Calculate RMSE
print('Root Mean Squared Error (RMSE):', np.sqrt(mean_squared_error(y_test, gbm_random.predict(X_gbm_fi_test))))
# Calculate Cross Validation Score
gbm_random_cv = cross_val_score(gbm_random, X_gbm_fi_test, y_test, cv=20, scoring='neg_mean_squared_error').ravel()
print ("cv-mean :",gbm_random_cv.mean())
print ("cv-std :",gbm_random_cv.std())
print ("cv-max :",gbm_random_cv.max())
print ("cv-min :",gbm_random_cv.min())
ax = y_test.reset_index()['strength'].plot(label="originals",figsize=(17,5), linestyle=':',color='r')
ax = pd.Series(gbm_random.predict(X_gbm_fi_test)).plot(label = "predictions",figsize=(17,5),color='springgreen')
plt.legend(loc="best")
plt.title("ORIGINALS VS PREDICTIONS")
plt.xlabel("index")
plt.ylabel("values")
Observations:
Let's use the best estimator from GridSearchCV further and apply resampling technique to determine accuracy in desired confidence interval
The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement.
It can be used to estimate summary statistics such as the mean or standard deviation. It is used in applied machine learning to estimate the skill of machine learning models when making predictions on data not included in the training data.
A desirable property of the results from estimating machine learning model skill is that the estimated skill can be presented with confidence intervals, a feature not readily available with other methods such as cross-validation.
# Bootstrap Sampling
st = time()
values = concrete.drop('ash', axis=1).values
n_iterations = 100 # Number of bootstrap samples to create
n_size = int(len(concrete)) # size of a bootstrap sample
# run bootstrap
stats = list() # empty list that will hold the scores for each bootstrap iteration
for i in range(n_iterations):
# prepare train and test sets
train = resample(values, n_samples=n_size) # Sampling with replacement
# picking rest of the data not considered in sample
test = np.array([x for x in values if x.tolist() not in train.tolist()])
# fit model
gbm_boot = grid.best_estimator_
gbm_boot.fit(train[:,:-1], train[:,-1]) # fit against independent variables and corresponding target values
y_test = test[:,-1] # Take the target column for all rows in test set
# evaluate model
predictions = gbm_boot.predict(test[:, :-1]) # predict based on independent variables in the test data
score = gbm_boot.score(test[:, :-1] , y_test)
stats.append(score)
print('Time taken %.2fs for resampling' % (time()-st))
# Plot confidence Interval
plt.figure(figsize=(20,5))
# plt.hist(stats)
sns.distplot(stats, bins=11)
alpha = 0.95 # for 95% confidence interval
p = ((1.0-alpha)/2.0) * 100 # tail regions on right and left .25 on each side indicated by P value (border)
lower = max(0.0, np.percentile(stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(stats, p))
plt.axvline(x=lower, color='r', linestyle=':')
plt.axvline(x=upper, color='r', linestyle=':', label='Confidence Interval Threshold')
plt.legend(loc="best",prop={"size":14})
print("With %.1f confidence interval, Gradient Boosting's score varies between %.1f%% and %.1f%%" %
(alpha*100, lower*100, upper*100))
Summing up all the observations and from the background knowledge of Civil Engineering, it is evident that: